There has occassionally been some doubt over the viability of data science careers, even within (or perhaps especially within) the field of data science itself: is it secure? Is it worth investing in? Is “data science” really even a career and not a buzzword for hours and hours of Microsoft Excel until you die? The best truth we can get is that data science is a broad term for a wide variety of statistical applications. Job titles can range from “Data Engineer” to “Data Analyst” or simply “Software Engineering” and tools range from the usual R, Python, and Excel to an increasing obscure amalgamation of libraries and packages. Job-seeking data scientists (such as the author) would find it insightful to their career search to see a breakdown of the data behind data science jobs and what salaries, degrees, tools, and locations are commonly listed. This project provides some EDA and basic NLP and geospatial information of data analyst job listings scraped from Glassdoor to address the research questions of what skills, degrees, and job titles are the most common across which locations, and where the most price-optimal city to work in would be.
Three datasets were used in this project: a dataset for the job listings, a dataset for the average rent prices by U.S, city, and a dataset of U.S. cities and their coordinates.
The dataset for the job listings was obtained from Github user picklesueat (link to dataset) and contains about 2500 rows, each row being an individual job listing web scraped from Glassdoor.com, a major job board site, in July 2020. The columns of the dataset describe each listing’s:
The job listing dataset is licensed under the MIT License, which allows free use and modification (license text available here).
The second data set for rent prices was obtained from Zillow’s Rent Index housing data, obtained through Kaggle.com (link to dataset). Zillow is an online real estate marketplace company and has provided their data on the estimated rent by city in the U.S. from January 2010 to January 2017; the latter was used as it is the most recent. The data from Zillow, while publicly available, is not listed as being under a specific license.
The third dataset, of US cities and their coordinates, is from the simplemaps U.S. cities database. (link to dataset). The dataset is licensed under Creative Commons and is available for use and modification.
We begin some exploratory data analysis with some histograms.
“Data Engineer” appears to be the far most common job title, followed by the very similar title “Big Data Engineer”. We also can see that “Senior Data Engineer” and “Sr. Data Engineer”, the 4th and 5th most common titles, are actually the same. The most common title without the word “Data” is “Software Engineer” at 3rd, with 60 occurrences, followed by “Machine Learning Engineer” at 7th with 12 occurrences.
One might expect STEM jobs to commonly be located in Bay Area, California, but within our data it appears that the jobs’ cities are most commonly in Texas, with Texas representing the top 2 cities and 4 of the 11 top cities. California’s appearances are both Southern California. We can check if Texas was mis-overrepresented here by graphing the locations by state only.
The order of states is not very different to the states as seen in the histogram of cities – Texas is still in the lead, followed by Arizona, Illinois, Pennsylvania, and then California.
Some basic Natural Language Processing was done on the job listings’ descriptions using Python’s Natural Language ToolKit library. The process that was used for extracting keywords of the jobs’ desired tools and level of education is described here.
A list of likely desired tools (such as coding languages, libraries) was created from each listing’s description, as well as a value for if a minimum level of education (Bachelor’s, Master’s Doctorate) was mentioned. We can view the overall frequencies of skills and education levels in the following histograms.
Note that “aw” is “Amazon Web Service (AWS)” but the S was removed in the NLP attempt at stemming (in this case, treating plural and singular as the same word). Python and SQL are the most frequently mentioned tools, followed by “cloud” (just a blanket term for using cloud services), Java, AWS, and three Apache frameworks.
Below is a histogram of mentioned minimum education level in job listings, showing that a Bachelor’s degree is the most commonly listed.
We can use the rent index data to compare the estimated salaries with the estimated cost of living. The data used here for the rent index is limited to cities that were ever included in the job listing data, which consists of 38 unique cities. Since the rent prices are monthly and the salaries are yearly, the rent values in the boxplot below have been multiplied by 12 for comparison.
Note that the Glassdoor salary estimate is as a range in the original dataset, e.g. “$90K - $115K”; this project used the midpoint between those two values. It appears that the average data scientist will be hired to make just shy of six digits a year with a median Glassdoor estimate salary of $94K, though the distribution is slightly right-skewed. Rent is more skewed than salary, and there is some overlap between the two, although we can empirically deduce that a location with an abnormally high rent and lower-quartile salary is unlikely.
We can compare the average estimated salary prices with the average estimated rent in each of our 88 unique cities that we have data of, shown in the scatterplot below.
Here, city average refers to the average salary of data analyst job listings in our dataset, not the average salary of all jobs in the city, and individual refers to an individual job listing.
Most city average salaries are below the 30% line, so we can conclude that data analyst jobs are making fairly decent money. We observe a cluster of cities above the line around the $120K salary, $50K yearly rent (about $4K monthly) area consisting of cities around the Bay Area, California. It appears that Westlake, TX has an unusually high rent. The town is a very small suburb with a population of under a thousand and a median salary of about $128K, so it is likely just an unusually well-off area.
There are a few outliers of smaller towns with apparently very optimal salary-to-rent ratios. Most six-digit salaries under the 30% line appear to be located in Texas or Southern California.
Finally, we plot our cities on a map to get a better view of the varying rent indeces, salaries, and job locations. Each city’s popup for the number of jobs also shows its three most common job titles.
We can draw similar observations as we did earlier: a high number of jobs in Texas, then California. We observe that the rent is typically lower in Texas while having a similar salary to California.
From this rudimentary financial perspective, the career outlooks for data scientists doesn’t seem so bad: most job offers only require a bachelor and have salaries within the 30% ratio for rent. Most job titles are seeking some form of engineer with knowledge of 2-3 programming languages, cloud services, and preferably experience with Apache frameworks. Jobs with an ideal rent-to-salary ratio (around or under 30%) are commonly located in Texas or Southern California.
This project gives a very rough overview of what the data science career search might look like, but searching for a job of course requires far more nuance. The variables explored in the project were limited to basic job qualifications, salary, and location, but in reality there are far more factors that impact job selection, ranging from the company itself to any number of personal circumstances to the intangible influence of networking.
There are also several shortcomings in the project methodology that reduce the reliability of its results. Most immediately is the fact that the job listing data is from 2020, the rental data from 2017, and neither being really up-to-date with the changes caused by COVID. One might be able to live in a low-rent city and high-end salaries by working remotely. The web-scraped job listing data would also have benefited from being cleaned; we saw earlier how “Senior Data Engineer” and “Sr. Data Engineer” were treated as different job titles.
Future work on this subject could include scraping one’s own data, as the job listing dataset used in this project also provided the script used to scrape it, allowing for more recent data. The currently used Python script for NLP is both unoriginal and highly inefficient from using pandas over numpy, so improving the NLP would also be subject to improvement. It would also be of interest to more rigorously explore the relationships between skills, salary, and location and see if latitude and longitude have any association with the types of careers desired.